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Abstract: Ilex asprella, which contains abundant a-amyrin type triterpenoid saponins, is 
an anti-influenza herbal drug widely used in south China. In this work, we first analysed 
the transcriptome of the /. asprella root using RNA-Seq, which provided a dataset for 
functional gene mining. mRNA was isolated from the total RNA of the /. asprella root and 
reverse-transcribed into cDNA. Then, the cDNA library was sequenced using an lUumina 
HiSeq™ 2000, which generated 55,028,452 clean reads. De novo assembly of these reads 
generated 51,865 unigenes, in which 39,269 unigenes were annotated (75.71% yield). 
According to the structures of the triterpenoid saponins of /. asprella, a putative biosynthetic 
pathway downstream of 2,3-oxidosqualene was proposed and candidate unigenes in the 
transcriptome data that were potentially involved in the pathway were screened using 
homology-based BLAST and phylogenetic analysis. Further amplification and flinctional 
analysis of these putative unigenes will provide insight into the biosynthesis of Ilex 
triterpenoid saponins. 

Kej^ords: Ilex asprella; triterpenoid saponins; biosynthesis; a-amyrin; RNA-Seq 
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1. Introduction 

Triterpenoid saponins are a class of widespread secondary metabolites in the plant kingdom. 
Chemical composition of triterpenoid saponins includes a triterpene moiety as the sapogenin and one 
or more attached sugar moieties such as glycosyl, glucuronyl or xylosyl. Triterpenoid saponins have 
drawn the attention of researchers because of their diverse bioactivities, including anti-inflammatory [1], 
anti-cancer [2], anti-microbial [3], insecticidal and anti-herbivore [4,5] activities. Owing to their 
significant pharmacological activities, plants rich in triterpenoid saponins are usually exploited as drug 
sources. However, the availability of triterpenoid saponins is hampered due to their low yield of crude 
drug extraction and difficulties in purification. Understanding the biosynthesis of triterpenoid saponins 
could help solve this problem. 

Ilex asprella, a traditional herbal drug widely used in Lingnan area of China, is a major component 
of some popular cooling beverages and anti-influenza remedies, and with an annual consumption of 
over 10,000 tons, has great economic value. The most characteristic constituents of /. asprella 
are triterpenoid saponins, which show anti-cancer [6] and anti-virus activity [7]. To date, over 
30 triterpenoid saponins have been isolated from the /. asprella leaves and roots (see Figure 1 and 
Table 1) [7-11]. These compounds can be classified into two main types: a-amyrin and P-amyrin. 
P-Amyrin, which is an oleanane, is a major configuration of pentacyclic triterpenoids, whereas 
a-amyrin, which is an ursane, is the isomer of P-amyrin but with a different location for C29 [12]. 
Interestingly, most of the triterpenoid saponins that were isolated from /. asprella roots were of the 
a-amyrin type (summarised in Table 1), except for one, which was of the P-amyrin type. 

Figure 1. Putative triterpenoid saponins biosynthetic pathway downstream of 
2,3-oxidosqualene in /. asprella. 
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Table 1. Main triterpenoid saponins isolated from the roots of/, asprella. 
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* The triterpene skeleton configurations are corresponded to Figure 1 . 



Over the past few years, the biosynthesis of triterpenoid saponins in some economically important 
plants, such as Glycine max [13] and Panax ginseng [14], has been studied. A biosynthetic pathway 
starting with the cycHsation of 2,3-oxidosqualene was suggested and involves three main steps: 

(i) cycUsation of 2,3-oxidosqualene catalysed by oxidosqualene cyclase (OSCs, EC 5.4.99.x); 

(ii) oxidative modification at various positions of the skeleton mediated by cytochromes P450 
(P450s, EC 1.14.X.X); and (iii) glycosylation of the decorated skeleton catalysed by family 1 uridine 
diphosphate glycosyltransferases (UGTs, EC 2.4.1.x). Accordingly, a hypothetical biosynthetic 
pathway of triterpenoid saponins in /. asprella is described in Figure 1. The biosynthetic pathway 
upstream of 2,3-oxidosqualene is believed to be the mevalonic acid (MVA) pathway in the cytosol, 
although evidence exists for crosstalk between the MVA and the methylerythritol phosphate (MEP) 
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pathways [15] (see Figure 2, which is adapted from the KEGG map00900 and modified according to 
the present study). 

Figure 2. Terpenoid backbone biosynthetic pathway. 
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The identification of genes involved in the biosynthetic pathway of terpenoid saponins has been 
achieved by using many different techniques, including the next-generation sequencing technology 
(NGS). A recently developed technique called RNA Sequencing (RNA-Seq) for transcriptome 
profiling using NGS techique has shown great potential for functional gene mining for non-model 
plants [16,17] and can help in the discovery of rare transcripts in the transcriptome owing to its great 
sequencing depth. Since no appropriate reference is available for the non-model plants, de novo 
assembly is the only option for sequence assembly [16]. Therefore, RNA-seq utilising lUumina 
next-generation sequencing was used for the transcriptomic study of the /. asprella root and the detection 
of candidate genes involved in the triterpenoid saponin biosynthetic pathway as presented in this study. 



2. Results and Discussion 



2.1. RNA-Seq Output, Sequence Assembly and Gene Annotation 

2.1.1. Transcriptome Sequencing Output and Sequence Assembly 

Next-generation sequencing was performed on RNA extracted from the /. asprella root and 
provided 55,028,452 high-quality (HQ) reads out from 58,670,910 raw reads (a yield of 93.79%). The 
Q20 and GC percentages were 98.08% and 46.34%), respectively. De novo assembly of these HQ reads 
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produced 110,049 contigs of 36,036,333 nucleotides (nt) and the average length of these contigs was 
327 nt, with an N50 of 540 nt. Further assembly of these contigs generated 51,865 unigenes; and 
the mean length and N50 of the unigenes were 685 and 1028 nt, respectively. Furthermore, the 
51,865 unigenes could be grouped into 16,517 distinct clusters and 35,348 distinct singletons, 
using homologous transcription cluster analysis. The distribution of contigs and unigenes is shown in 
Figure S 1 . 

2.1.2. Gene Expression Overview 

To investigate the expression levels of the sequencing data, the FPKM (Fragments per kilobase of 
exon model per million mapped fragments) values were applied to normalise and evaluate each 
unigene. Statistics of the distribution of the FPKM values, listed in Table 2, showed that the expression 
level of most unigenes was between 1 and 10. 



Table 2. FPKM values distribution. 
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2.1.3. Functional Annotation 

The 51,865 unigenes were successfully annotated through comparison with the sequences in the 
major public databases. In total, 39,269 unigenes were annotated to at least one database, which 
accounted for 75.71% (see Table 3). For Gene Ontology annotation, 29,375 unigenes were mapped to 
57 functional groups (see Figure S2), among which, 18,932 were involved in the "metabolic process". 
Of the 12,860 unigenes that were assigned to the COG database, 656 belonged to the cluster 
"secondary metabolites biosynthesis, transport and catabolism" (see Figure S3). The KEGG annotation 
profiled the biological pathways that are active in /. asprella and 20,752 unigenes were mapped to 
128 KEGG pathways. Moreover, 272 unigenes were assigned to five terpenoid-like biosjoithesis 
processes. One hundred and two unigenes (0.49%) mapped to "terpenoid backbone biosjoithesis", 
13 (0.06%) mapped to "Monoterpenoid biosynthesis", 77 (0.37%) mapped to "Diterpenoid 
biosjoithesis", 30 (0.14%) mapped to "Sesquiterpenoid and triterpenoid biosjoithesis" and 50 (0.24%) 
mapped to "Ubiquinone and other terpenoid-quinone biosynthesis". Based on the annotation results, 
the candidate genes related to terpenoid backbone and triterpenoid synthesis were identified and 
discussed in detail. 
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Table 3. Unigenes mapped to the public databases. 



Public database 
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Annotation percentage/% 
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2.2. Candidate Genes Involved in the Biosynthesis ofTriterpenoid Saponins 

2.2.1. Terpenoid Backbone Bios5aithesis 

Terpenoids are derived from C5 isoprene units through a "head-to-tail" connection. The conjunction 
of a different number of C5 isoprene units brings about various intermediates, such as IPP (C5 unit), 
DMAPP (C5 unit), GPP (CIO unit), and FPP (15 unit), which form the carbon skeletons of the 
different terpenoids. IPP, along with its isomer, DMAPP, are important intermediates in terpenoid 
backbone formation; both intermediates can be synthesized through the MVA pathway in the 
cj^oplasm, or the MEP pathway in the plastid. Genes encoding all of the essential enzjmies for both 
pathways were found in this transcriptome data, indicating that both pathways are active in the 
/. asprella roots. 

The MVA pathway is essential for the biosynthesis of sterols, sesquiterpenes and triterpenoids. 
Twenty-two unigenes in the /. asprella transcriptome, including four AACT genes, three HMGS 
genes, six HMGR genes, one MK gene, five PMK genes and three MDC genes (see Table SI), were 
identified to be involved in the MVA pathway. Among these enzymes, HMGR catalyses the 
conversion of HMG-CoA into MVA, which is an irreversible, two-step biochemical reaction that 
reduces the thioester group into a primary alcohol [18]. Therefore, HMGR is considered an important 
rate-limiting enzyme in the MVA pathway. In this study, the presence of six highly homologous genes 
implied that HMGR might be encoded by multiple genes in /. asprella. Together with other plant 
HMGRs, these six candidate unigenes were derived from a common ancestor (see Figure 3). 

Monoterpenes and diterpenes are synthesised through the MEP pathway, and 23 unigenes encoding 
enzymes involved in this pathway, including 13 DXS genes, four DXR genes, two MDS genes and 
one each of MCT, CMK, HDS and HDR (see Table S2), were found in the transcriptome. DXSs 
catalyse the formation of 1-deoxy-D-xylulose 5 -phosphate through the condensation of pjruvate and 
glyceraldehydes-3-phosphate. The DXS genes can be classified into three clades: the DXSl clade is 
involved in primary metabolism; the DXS2 clade is responsible for secondary terpenoid biosjaithesis; 
and the recently elucidated DXS3 clade is involved in the biosjoithesis of products that are essential for 
plant survival but expressed at a low level [19]. Further classification was predicted for these 13 genes 
with other defined plant DXSs using phylogenetic analysis (see Figure 4). The results showed that 
four unigenes, including CL2147.contigl, CL131 l.contig2 and CL2836.cotigl and 2, were grouped 
into the DXSl clade; unigene26662 was the only unigene grouped into the DXS3 clade; and the 
remaining eight unigenes were grouped into the DXS2 clade. 
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Figure 3, Phylogenetic tree of HMGRs. Arabidopsis thaliana AtHMGR (NM_127292), 
Cucumis sativus CsHMGR (XM_004170850), Cydocarya paliurus CpHMGR (EU296534), 
G. max GmHMGR (XM_003547838), Glycyrrhiza uralensis GuHMGR (GQ845405), 
Medicago truncatula MtHMGR (XM_003629008), P. quinquefolius PqHMGR (FJ755158), 
Populus trichocarpa PtHMGR (XM_0023 13533), Solanum tuberosum StHMGR 
(NM_001288532), Taxus x media TxmHMGR (AY277740), Outgroup: Danio rerio 
DrHMGR (NM_001014292). 
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Both MVA and ME? pathways produce the C5 unit IPP, which can be transformed into its isomer, 
DMAPP, by IDI (Isopentenyl diphosphate isomerase). Meanwhile, IPP and DMAPP are assembled 
into GPP, FPP and GGPP by a series of prenyl transferases, including GPPS, FPPS and GGPPS. FPP 
is an important intermediate of triterpenoid biosynthesis. Two units of FPP join in a "tail-to-tail" 
fashion, catalysed by squalene synthase (SS), to yield the hydrocarbon squalene. Subsequently, 
squalene is oxidised by squalene monooxygenase (SM) with the cofactors O2 and NADPH to give 
rise to another important precursor, 2,3-oxidosqualene. We found two IDI genes (Unigene3767, 
Unigene3833), three GPS genes (Unigene837 and CL7170.Contigl and 2), four FPPS genes 
(CL2187.Contigl to 4), seven GGPPS genes (CL4542.Contigl, CL6970.Contigl, Unigenel0539, 
Unigenell301, Unigenel7743, Unigene27823 and Unigene8778), three SS genes (CL3649.Contigl to 3) 
and seven SM genes (CL3649.Contigl to 3, Unigenel4310, Unigenel5274, Unigenel8579 and 
Unigenel988) in the /. asprella transcriptome (see Table S3). 
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Figure 4. Phylogenetic tree of DXSs. This tree clearly shows that the distribution of the 
/. asprella sequences throughout the three clades of the tree. Aquilaria sinensis AsDXS3 
(JX860325), A. thaliana AtDXS (NM_1 17647), AtDXSl (NM_1 13045), AtDXS3 
(NM_121176), Catharanthus roseus CrDXS2 (DQ848672), Ginko biloba GbDXS 
(AY505128), G. max GmDXSl (NM_00 1249 141), Hevea brasiliensis HbDXS2 
(DQ473433), M truncatula MtDXSl (AJ430047), Salvia miltiorrhiza SmDXS2 
(FJ643618), Vitis vinifera VvDXS (XM_002266889), Zea mays ZmDXSl 
(NM OO 1164333), Outgroup: Perkinsus marinus PmDXS (AB284361). 
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2.2.2. Amyrin Synthaes 

As previously described, OSCs catalyse the cyclisation of 2,3-oxidosqualene to form a variety of 
triterpene skeletons [20], including phytosterol, dammarane, lupane and oleane (P-amyrin) [21]. This 
step is thus a critical branching point for phytosterol and triterpenoid biosynthesis. Over fifty different 
OSCs have been cloned from various plant species [22]. Among those OSCs, amyrin synthase 
catalyses the cyclisation of 2,3-oxidosqualene into a-amyrin and P-amyrin, resulting in a 
chair-chair-chair-boat conformation. Nine unigenes were identified to be amyrin synthase genes in this 
study (see Table S4). In addition, phylogenetic analysis of these nine unigenes indicated that they 
exhibit close homologous relationships with P-amyrin synthases and multiflinctional amyrin synthases 
(see Figure 5). This prediction was supported by the presence of a-amyrin and P-amyrin type 
triterpenoids in the /. asprella roots. Among the nine candidates, CL3079.Contigl and CL481.Contigl 
were found to contain a full-length cDNA, including start and stop codons and a polyA signal, using 
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the online tool GENSCAN. To confirm the gene sequences, primers were designed to anneal around 
the predicted start and stop codons of the two genes, resulting in an expected length of approximately 
2500 base pairs (bp). Both genes were successfully amplified from cDNA generated from a different 
/. asprella root sample (see Figure 6, the sequencing result is shown in Figure S4) and will be 
characterised in the future. 

Figure 5. Phylogenetic tree of OSCs. The tree illustrates the likely gene flinction of 
19 characterized OSC genes and nine candidate OSC unigenes found in this study. 
A. thaliana AtOSC-muhi (NM_106497), A. thaliana AtOSC-ls (NM_1 14382), 
Artemisia annua AaOSC-ba (EU330197), Bruguiera gymnorhiza BgOSC-lu (AB289586), 
C. roseus CrOSC-multi (JN991165), Cucurbita pepo CpOSC-ls (ABl 16239), G. max 
GmOSC-ba (AY095999), G. glabra GgOSC-lu (ABl 16228), Kandelia candel 
KcOSC-multi (AB257507), Lotus japonicas LjOSC-lu (ABl 8 1245), Luffa aegyptiaca 
LaOSC-ls (AB033335), Malus x domestica MxdOSC-multi (FJ032006), M. truncatula 
MtOSC-ba (AJ430607), Olea europaea OeOSC-multi (AB291240), P. ginseng PgOSC-ba 
(AB009030), P. ginseng PgOSC-lu (AB009031), Solanum lycopersicum SlOSC-multi 
(HQ266580), Taraxacum officinale ToOSC-ls (AB025345), Outgroup: Mus musculus 
MmOSC (NM_146006); ba is for P-amyrin synthase, multi is for multifunctional OSC 
gene. Is is for lanosterol synthase, lu is for lupeol synthase. 
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Figure 6. PGR products of amplified CL3079.contigl (A) and CL481.contigl (B). 
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The gene on CL3079.Contigl (designated as laASl) has a length of 2271 bp encoding 756 amino 
acids, showing 82% amino acid sequence identity to the mixed amjTin synthase CrAS from 
Catharanthus roseus. In contrast, the gene on CL481.Contigl (designated as laASI) is 2274 bp long 
encoding 757 amino acids, showing 86% amino acid sequence identity (Supplementary Figure S5) to 
the P-amjTin synthase AeAS from Aralia elata [23]. The QW repeat, DCTAE motif, as well as the 
MWCYCR motif, is present in both laASl and IaAS2 [24—26]. Multiple alignment analysis of laASl 
and IaAS2 with 19 ASs randomly selected from Genbank highlights some amino acid residues 
conserved in mixed AS. Among these, Glu46, the PVRXXE motif, Asn 157, Thr 263, Ile290, Leu402, 
Ile614 and Thr677 in laASl are likely candidates responsible for multiproduct nature exhibited by 
mixed AS, as these residues are located just near the QW repeat or conserved motif. 

2.2.3. P450s 



Following the formation of amyrin, functional groups, such as hydroxyl and carboxyl, are 
infroduced at different positions of the backbone, and this reaction is catalysed by the P450s 
(EC 1.14.X.X). This step contributes to increasing structural diversity [27,28]. As in the hypothesis 
described in Figure 1, CYP450s of C19-oxidase and C28-oxidase are essential for modifying a-amyrin 
during the formation of pomolic acid, which is an important intermediate compound. Dehydration of 
pomolic acid yields a second double bond, in addition to the first bond at C12-C13. This reaction 
might be catalysed spontaneously or by a dehydrase(s). Through either method, the dehydration of 
pomolic acid gives rise to randialic acid, 19-dehydrousolic acid and 19(29)-dehydrousolic acid, which 
are precursors of skeletons A, B and C, respectively. Moreover, isomerisation of these three precursors 
might take place spontaneously or mediated by isomerase. Further, C23 -oxidase and C24-oxidase are 
necessary to form skeleton D. In brief, it is expected that these oxidisations are catalysed by P450s. 
The P450s is one of the largest and most diverse gene families in plants [29,30]. To date, only a few 
P450s have been identified as involved in triterpenoid biosynthesis. CYP93E1 from G. max was the 
first one characterised, and catalyses the C24-hydroxylation of P-amyrin [31]. CYP93E3 in liquorice 
exhibits a similar catalytic activity as CYP93E1 [32]; and CYP716AI2 in M. truncatula and 
CYP716AL1 in C. roseus were characterised as multiflmctional enzymes with P-amyrin 28-oxidase, 
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a-amyrin 28-oxidase and lupeol 28-oxidase activities [27,33]. In the transcriptomic data, 269 unigenes 
were annotated to be P450s. BLASTp analysis using the above-mentioned characterised genes as 
queries against the transcriptome narrowed down the potential unigene number. Ten unigenes 
(CL1221.Contigl to 3, CL3010.Contigl to 4, unigenel0591, unigene23155 and unigene25510) were 
highly homologous to CYP716A12 and CYP716AL1, with peptide sequence identities of more than 
55%, which implies that they may belong to the same subfamily, according to the assignment of P450s 
to families and subfamilies developed by Nelson [34]. In addition, CL410.Contigl exhibited 49.22% 
and 51.07%) sequence identity to CYP93E1 and CYP93E3, respectively, which was coincident with the 
KEGG annotation. These P450s are shown in Table S5. 

2.2.4. UGTs 

UGTs catalyse the transfer of glycosyl residues to the precursors that are decorated by P450s. The 
introduction of a glycosyl moiety to a triterpene increases its aqueous solubility, thus making it a 
triterpenoid saponin. UGTs catalyse the glucosylation of C3-hydroxyl and C28-carboxyl, which is 
essential to complete the triterpenoid saponin biosjaithetic pathway in the /. asprella root. Like the 
P450s, UGTs constitute a large and diverse gene family. Sequences belonging to the same family and 
subfamily exhibit amino acid sequences identity >40% and >60%, respectively [35]. In the cDNA 
library of /. asprella, 335 unigenes were found to encode UGTs. Among them, five unigenes 
(CL679.Contig3, Unigene5668, Unigene29448, Unigene26225 and Unigene3060) exhibited high 
homology to UGT73C10 and UGT73C12 in Barbarea vulgaris (see Table S6), whose bioactivity is to 
catalyse the 3-(9-glucosylation of oleanolic acid [36]; and one unigene (CL1465.Contig3) was 
homologous to UGT73F3 in M. truncatula, which catalyses the glucosylation of the C28-carboxy 
group of oleanane sapogenins [37]. Moreover, one unigene (Unigene82) was found to have the highest 
identity of 50.00% to the amino acid sequence of UGT74M1 in Saponaria vaccaria [38], which 
preferentially catalyses 28-0-glucosylation of oleanane-type sapogenins. 

2.3. Discussion 

Recently, interest in the biosynthesis of triterpenes has gradually increased because of their 
economical and scientific importance. A number of OSCs involved in the formation of triterpene 
carbon skeletons have been identified and characterized [39]. While mono functional P-amyrin 
synthases and lupeol synthases were found, all the a-amyrin synthases identified so far are 
multifunctional and yield more than one product [33]. Many species of the Ilex genus of plants are rich 
in triterpenoid saponins, mostly of the a-amyrin type. The transcriptomic analysis of/, asprella has 
revealed a few AS candidate genes, and a close investigation into these candidates and their comparison 
with previously characterised AS genes would provide important knowledge of this gene family. 

Unlike OSCs, the identification of new CYP450s and UGTs involved in the biosjoithesis of 
triterpenoid saponins is beset with difficulties owing to the poor relationship between gene homology 
and functions of these two gene families. In this study, gene annotation provided a great number of 
putative CYP450s and UGTs. The candidate number was narrowed down to a few homologous 
unigenes by applying direct, homology-based screening of characterised genes. However, it is 
unknown whether these candidates are actually involved in the biosjoithesis of triterpenoid saponins. 
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Therefore, additional strategies should be engaged to identify credible candidate CYP450s and UGTs. The 
combination of exlicitor-induced expression regulation and co-expression analysis with OSC [33,37,40] 
would contribute to identification of the targeting CYP450s and UGTs in /. aprella. 

Triterpenoid saponins were isolated from various tissues of/, aprella like roots and leaves, but this 
study is restricted to root tissue. A more thorough analysis of different plant tissues coupled with 
metabolomic data would help in building a global picture of Bex triterpenoid biosynthesis and perhaps 
find novel candidate genes which would otherwise be difficult with sequence-homology-based searches. 

3. Experimental Section 

3.1. Plant Material and RNA Preparation 

Two-year-old, potted /. asprella was collected from the Planting Base in Meizhou, Guangdong 
province, China. The /. asprella root was flushed under running tap water to remove soil and other 
attachments. After quick drying with bibulous papers, the root tissue was cut into approximately 
1-mm-thick segments, snap frozen in liquid nitrogen and stored at -80 °C until further processing. 
Total RNA of the root was isolated using RNAiso Plus and RNAiso-mate for Plant Tissue (Takara, 
Dalian, China) following the product manual. The integrity, purity and concentration of the total RNA 
were analysed using agarose gel electrophoresis and ultraviolet spectroscopy. 

3.2. cDNA Synthesis and Sequencing 

Poly(A) mRNA was isolated from total RNA using Oligo (dT) beads, and then broken into short 
fragments using fragmentation buffer. Using these fragments as templates, random hexamer-primers 
were used to synthesise the first-stand cDNA. The second-strand cDNA was synthesised using GEX 
Second Strand buffer (10 |iL), 25 mmol-L"^ dNTPs (1.2 |iL), NRaseH (1 \\L) and DNA polymerase I 
(5 |j,L). The short fragments of double-stranded cDNA were purified using the QiaQuick PCR 
extraction kit (Qiagen, Duesseldorf, NW, Germany) and eltuted with elution buffer for end repairing 
and adding of poly(A). Next, the short fragments were connected to sequencing adapters and purified 
by agarose gel electrophoresis. Suitable fragments were selected as templates for PCR amplification. 
Finally, the cDNA library was sequenced using an lUumina HiSeq™2000 (lUunima, San Diego, 
CA, USA). 

3.3. Sequence Quality Control and Cleaning 

The raw sequencing data of the cDNA library was transformed by base calling into raw reads, and 
stored in the FASTQ format. The sequence quality (sQ) was evaluated using the following formula: 
sQ = -10-lgE (E is the sequencing error rate). Raw reads (i) with a 3' adaptor; (ii) with more than 5% 
uncertain nucleotides; and (iii) of low quality (Sq < 10 bases counted for more than 20% of the reads) 
were filtered out to generate clean reads. The clean reads were then used for flirther analysis and 
uploaded to the Sequence Read Archive (SRA) at NCBI with the accession number SRP035767. 



Int. J. Mol. Sci. 2014, 15 



5982 



3.4. Sequence Assembly 

De novo assembly was performed using the short reads assembling program Trinity [41], which 
combines reads with a certain length of overlap to form longer fragments called contigs. Then, the 
reads were mapped back to these contigs. Furthermore, the contigs were assembled using Trinity to 
generate sequences that could not be extended on either end, which are defined as unigenes. Using 
homologous transcription cluster analysis, the unigenes were classified into two groups: unigenes with 
homologies higher than 70% were designed as clusters (initialled with CL, numbered with the gene 
family), while sole unigenes were designed as singletons (initialled with Unigene). 

The assembled unique sequences were aUgned to protein databases in the following order: 
Non-redundant (Nr) protein database, the Swiss-Prot protein database, the Kyoto Encyclopedia of 
Genes and Genomes (KEGG) database and the Cluster of Orthologous Groups of proteins (COG) 
database, by applying BLASTx (E value threshold set at 10"^). Sequence hits in a former database 
would not advance to a search against the next database. The CDS of the unigenes were extracted 
and franslated into peptide sequences. Based on the BLAST results, the sequence direction was 
determined. For unigenes with uncertain direction after BLAST analysis, a statistical Hidden Markov 
Model (HMM) program, ESTscan [42], was introduced to help determine the sequence direction. 
Sequences whose directions could not be predicted using the ESTscan program were assigned their 
initial assembled sequence direction. 

3.5. Gene Annotation 

First, the unigene sequences were aligned by applying BLASTx against protein databases, including 
Nr, Swiss-Prot, KEGG and COG (E value threshold set at 10"^), and then using BLASTn against the 
Nucleotide (Nt) database (E value threshold set at 10"^). Proteins with the highest sequence similarity 
with the given unigenes along with their protein functional annotations were refrieved. Gene Ontology 
(GO) fijnctional annotation of the unigenes was obtained along with the Nr annotation. The noted 
unigenes were assigned to GO categories for Molecular Function, Biology Process and Cellular 
Component using the Blast2GO program [43]. To portray the distribution of the fimctions of these 
noted unigenes, the WEGO program [44] was applied to classify the GO terms. 

3.6. Gene Expression Analysis 

The gene expression levels were analyzed by quantifying the read abundance observed. Paired-end 
reads mapped to a common contig were normalized by calculating FPKM values [45] for each contig 
by the formula followed: FPKM = (1,000,000- Q/(iV-Z- 1000), where C is the number of fragments that 
uniquely aligned to an objective gene, N is the total number of fragments that uniquely aligned to all 
genes, and L is the number of bases in the objective gene. 

3. 7. Homology-Based Gene Discovery 

Protein sequences of 4 characterized CYP450s and 4 UGTs, which were reported to be involved in 
the triterpenoid saponin biosjaithetic pathway, were selected as objectives to run BLASTp analysis 
against the protein sequences that were translated from the raw reads data (E value threshold set at 10"^). 
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3.8. Phylogenetic Analysis 

MEGA 6.05 was applied to perform the phylogenetic analysis of the nucleotide sequences of the 
target genes using the Maximum Likelihood method [46]. The reliability of all trees was evaluated 
using the bootstrap re-sampling method with 1000 replications. 

3.9. Amplification of 2 OSC Genes 

Total RNA was extracted from a different plant using the previously described method. Then, 
cDNA was synthesised using the PrimeScript™ RT-PCR Kit (TAKARA, Dalian, China) following 
the product manual. The primers for amplifying the complete CDS of CL3079.contigl and 
CL481.contigl were designed as followed: 

for CL3079.contigl, Forward: TCTCTCTGTGTTTATGGGTA (5'^3') and 
reverse: GAACACTGAAGGATACAAAC (5'^3'). 

for CL481.contigl, Forward: GCCACAGTTATCTTCGTATT (5'^3'), and 
reverse: CATACTTCAAGGACCTCAAA (5'^3'). 

The Polymerase Chain Reaction (PCR) contained 10 |j,L of PrimeSTAR Max DNA Polymerase 
(TAKARA, Dalian, China), 0.4 ^iL of each primer (10 mM), 1 ^iL of cDNA from the /. asprella root 
and water up to 20 |xL. The amplification reaction was performed using the following temperature 
procedure: 98 °C for 2 min; 30 cycles of 98 °C for 10 s, 50 °C for 15 s, 72 °C for 15 s; and 72 °C for 
5 min. Subsequently, 5 |xL of the PCR product was mixed with 1 |xL of 6x loading buffer, visualized 
using 1% agarose gel electrophoresis with Goldview dye (120 V for 12 min). Nucleotide sequencing 
was carried out by BGI Co., Lit (BGI, Beijing, China). 

4. Conclusions 

The transcriptome of the /. asprella root was obtained using RNA-Seq, resulting in many unigenes. 
The unigene dataset that was generated in this study provides a significant resource for fijrther 
molecular studies of /. asprella, especially for characterising candidate genes in the biosjoithetic 
pathways of triterpenoid saponins. Using appropriate approaches, a series of candidate genes were 
identified and were consequently analysed for expression patterns and phylogenetic relationships. 
A comprehensive bioinformatics analysis contributed to a better understanding of the candidate genes 
and to a reliable design for further research. The putative genes identified in /. asprella will be cloned 
and characterised in further studies. 
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